Learning predictive cognitive maps with spiking neurons during behavior and replays

The hippocampus has been proposed to encode environments using a representation that contains predictive information about likely future states, called the successor representation. However, it is not clear how such a representation could be learned in the hippocampal circuit. Here, we propose a plasticity rule that can learn this predictive map of the environment using a spiking neural network. We connect this biologically plausible plasticity rule to reinforcement learning, mathematically and numerically showing that it implements the TD-lambda algorithm. By spanning these different levels, we show how our framework naturally encompasses behavioral activity and replays, smoothly moving from rate to temporal coding, and allows learning over behavioral timescales with a plasticity rule acting on a timescale of milliseconds. We discuss how biological parameters such as dwelling times at states, neuronal firing rates and neuromodulation relate to the delay discounting parameter of the TD algorithm, and how they influence the learned representation. We also find that, in agreement with psychological studies and contrary to reinforcement learning theory, the discount factor decreases hyperbolically with time. Finally, our framework suggests a role for replays, in both aiding learning in novel environments and finding shortcut trajectories that were not experienced during behavior, in agreement with experimental data.


Introduction
Mid twentieth century, Tolman proposed the concept of cognitive maps (Tolman, 1948). These maps are abstract mental models of an environment which are helpful when learning tasks and in decision making. Since the discovery of hippocampal place cells, cells that are activated only in specific locations of an environment, it is believed that the hippocampus can provide the substrate to encode such cognitive maps (O'Keefe and Dostrovsky, 1971;O'Keefe and Nadel, 1978). More evidence of the role of the hippocampus in behavior was found in numerous experimental studies, such as the seminal water maze experiments (Morris, 1981;Morris et al., 1982), radial arm maze experiments (Olton

The successor representation
In this section, we will give an overview of the successor representation and its properties, especially geared toward neuroscientists. Readers already familiar with this representation may safely move to the next section.
To understand the concept of successor representation (SR), we can consider a spatial environment -such as a maze -while an animal explores this environment. In this setting, the SR can be understood as how likely it is for the animal to visit a future location starting from its current position. We further assume the maze to be formed out of a discrete number of states. Then, the SR can be more formally described by a matrix with dimension ( Nstates × Nstates ), where Nstates denotes the number of states in the environment and each entry R ij of this matrix describes the expected future occupancy of a state S j when the current state is S i . In other words, starting from S i , the more likely it is for the animal to reach the location associated with state S j and the nearer in the future, the higher the value of R ij .
As a first example, we consider an animal running through a linear track. We assume the animal runs at a constant speed and always travels in the same direction -left to right (Figure 1a). We also split the track into four sections or states, S 1 to S 4 , and the SR will be represented by a matrix with dimension ( 4 × 4 ). Since the animal always runs from left to right, there is zero probability of finding the animal at position i if its current position is greater than i . Therefore, the lower triangle of the successor matrix is equal to zero ( Figure 1b). Alternatively, if the animal is currently at position S 1 , it will be subsequently found at positions S 2 , S 3 , and S 4 with probability 1. The further away from S 1 , the longer it will take the animal to reach that other position. In terms of the successor matrix, we apply a discounting factor γ ( 0 < γ ≤ 1 ) for each extra 'step' required by the animal to reach a respective location ( Figure 1b).
Even though we introduced the linear track as an illustrative example, the SR can be learned in any environment (see Figure 1-figure supplement 1 for an example in an open field). Note that the representation learned by the SR is dependent not only on the structure of the environment, but also on the policy -or strategy -used by the animal to explore the environment. This is because the successor representation is not purely concerned with the physical distance between two areas in the environment, but rather it measures how long it usually takes to reach one place when starting from the other. In this first example, the animal applied a deterministic policy (always running from left to right), but the SR can also be learned for stochastic policies. Furthermore, the SR is a multi-step representation, in the sense that it stores predictive information of multiple steps ahead.
Because of this predictive information, the SR allows sample-efficient re-learning when the reward location is changed (Gershman, 2018). In reinforcement learning, we tend to distinguish between model-free and model-based algorithms. The SR is believed to sit in-between these two modalities. In model-free reinforcement learning, the aim is to directly learn the value of each state in the environment. Since there is no model of the environment at all, if the location of a reward is changed, the agent will have to first unlearn the previous reward location by visiting it enough times, and only then it will be able to re-learn the new location. In model-based reinforcement learning, a precise model of the environment is learned, specifically, single-step transition probabilities between all states of the environment. Model-based learning is computationally expensive, but allows a certain flexibility. If the reward changes location it is immediate to derive the updated values of the states. As we have seen, however, the SR can re-learn a new reward location somewhat efficiently, although less so than model-based learning. The SR can also be efficiently learned using model-free methods and allows us to easily compute values for each state, which in turn can guide the policy (Dayan, 1993;Russek et al., 2017;Momennejad et al., 2017). This position between model-based and modelfree methods makes the SR framework very powerful, and its similarities with hippocampal neuronal dynamics have led to increased attention from the neuroscience community. Finally, in our examples above we considered an environment made up of a discrete number of states. This framework can be generalised to a continuous environment represented by a discrete number of place cells.

Learning the successor representation in biologically plausible networks
We propose a model of the hippocampus that is able to learn the successor representation. We consider a feedforward network comprising of two layers. Similar to McNaughton and Morris, 1987 Hasselmo and Schnell, 1994;Mehta et al., 2000;Hasselmo et al., 2002, we assume that the presynaptic layer represents the hippocampal CA3 region and is all-to-all connected to a postsynaptic layer -representing the CA1 network ( Figure 1c). The synaptic connections from CA3 to CA1 are plastic such that the weight changes follow a spike-timing-dependent plasticity (STDP) rule consisting of two terms: a weight-dependent depression term for presynaptic spikes and a potentiation term for prepost spike pairs ( Figure 1d). For simplicity, we assume that the animal spends a fixed time T in each state. During this time, a constant activation current is delivered to the CA3 neuron encoding the current location and, after a delay, to the corresponding CA1 place cell (see Materials and methods). On top of these fixed and location-dependent activations, the CA3 neurons can activate neurons in CA1 through the synaptic connections. In other words, the CA3 neurons are activated according to the current location of the animal, while the CA1 neurons have a similar location-dependent activity combined with activity caused by presynaptic neurons. The constant currents delivered directly to CA3 and CA1 neurons can be thought of as location-dependent currents from entorhinal cortex. These activations subsequently trigger plasticity at the synapses, and we can show analytically that, using the spike-timing dependent plasticity rule discussed above, the SR is learned in the synaptic weights (Figure 1e and f, and see Appendix).
Moreover, we find that, on an algorithmic level, our weight updates are equivalent to a learning algorithm known as TD( λ ), a powerful and well-known algorithm in reinforcement learning that can be used to learn the successor representation. TD( λ ) is based on a mixed methodology, which is regulated by the parameter λ . At one extreme, when λ = 1 , the SR is estimated by taking the average of state occupancies over past trajectories. This type of algorithm is called TD(1) or Monte Carlo (MC). At the other extreme, when λ = 0 , the estimate of the SR is adjusted 'online', with every step of the trajectory, by comparing the observed position with its predicted value. This algorithm is equivalent to TD(0). For all values of λ in between, the algorithm employs a mixture of both methodologies. The extreme cases of TD(1) and TD(0) have different strengths and weaknesses, as we will discuss in more detail in the next sections.
In practice, we prove analytically the mathematical equivalence of the dynamics of our spiking neural network, and the TD( λ ) algorithm (see Appendix). Our calculations essentially prove that, at each step, our neural network tracks the reinforcement learning algorithm, known to converge to the theoretical values of the SR. This equivalence guarantees that our neural network weights will eventually converge to the correct SR matrix. As a proof of principle, we show that it is possible to learn the SR for any initial weights (Figure 1-figure supplement 2), independently of any previous learning in the CA3 to CA1 connections.
Importantly, from our analytical derivations (see Appendix), we find that the λ parameter depends on the behavioral parameter T (the time an animal spends in a state). We find that, the larger the time T, the smaller the value of λ and vice-versa. In other words, when the animal moves through the trajectory on behavioral time-scales (large T compared to the synaptic plasticity time-scales τ LTP ), the network is learning the SR with TD( λ ∼ 0 ). For quick sequential activities (T → 0), akin to hippocampal replays, the network is learning the SR with TD( λ ∼ 1 ). As we will discuss below, this framework therefore combines learning based on rate coding as well as temporal coding. Furthermore, from our model follows the prediction that replays can also be used for learning purposes and that they are algorithmically equivalent to MC, whereas during behavior, the hippocampal learning algorithm is equivalent to TD( λ ). This strategy of using replays to learn is in line with recent experimental and theoretical observations (see Momennejad, 2020 for a review). depression term is dependent on the synaptic weight and presynaptic spikes (blue). The potentiation term depends on the timing between a pre-and post-synaptic spike pair (red), following an exponentially decaying plasticity window (bottom). (E-F) Schematics illustrating some of the results of our model. (E) Our spiking model learns the top row of the successor representation (panel B) in the weights between the first CA3 place cell and the CA1 cells. (F) Our spiking model learns the third row of successor representation (panel B) in the weights between the third CA3 place cell and the CA1 cells.
The online version of this article includes the following figure supplement(s) for figure 1:   To validate our analytical results, we use again a linear track with a deterministic policy. Using our spiking model with either rate-code activity on behavioral time-scales (Figure 2a top) or temporalcode activity similar to replays (Figure 2b top), we show that the synaptic weights across trials match the evolution of the TD( λ ) algorithm closely (Figure 2a and b middle). While convergence to the SR is guaranteed (Figure 2a and b bottom) due to the mathematical equivalence between our setup and TD( λ ) (Figure 2-figure supplement 1), the learning trajectory has more variance in the neural network case due to the noise introduced by the randomness of the spike times. This noise can be mitigated by averaging over a population of neurons. Moreover, due to the equivalence with TD( λ ), our setup is general for any type of task where discrete states are visited, in any dimension, and which may not need to be a navigation task (see e.g. Figure 1-figure supplement 1 for a 2D environment).
In summary, we showed how the network can learn the SR using a spiking neural model. We analytically showed how the learning algorithm is equivalent to TD( λ ), and confirmed this using numerical simulations. We derived a relationship between the abstract parameter λ and the timescale T representing the animal's behavior -and in turn the neuronal spiking -allowing us to unify rate and temporal coding within one framework. Furthermore, we predict a role for hippocampal replays in learning the SR using an algorithm equivalent to Monte Carlo. Learning over behavioral time-scales using STDP An important observation in our framework is that the SR can be learned using the same underlying STDP rule over time-scales ranging from replays up to behavior. One can now wonder how it is possible to learn relationships between events that are seconds apart during awake behavior, without any explicit error encoding signal typically used by the TD algorithm, and while the STDP rule is characterised by millisecond time-scales (Figure 3a).
From a neuroscience perspective, this can be understood when considering the trajectory of the animal. Each time the animal moves from a position S j−1 to a position S j , the CA3 cell encoding the location S j−1 stops firing and the CA3 cell encoding the location S j starts firing. Since in our example this transition is instantaneous, these cells are activating the same CA1 cells consecutively. Therefore, the change in the weight w i,j−1 depend on the synaptic weight of the subsequent state w i,j (Figure 3b, yellow depends on orange, orange depends on red, etc). Indeed, in our example of an animal in a linear track subdivided into four locations, the weights on the diagonal, such as w 4,4 , are the first ones to be learned, since they are learned directly. The off-diagonal weights, such as w 3,4 , w 2,4 , and w 1,4 , are learned consecutively more slowly as they are dependent on the subsequent synaptic weight. (iii) the CA3 firing rate in state 3 is doubled (green and panel F). Panels E and F lead to a modified discount parameter in state 3, affecting the receptive fields of place cells 3 and 4.
Eventually, weights between neurons encoding positions that are behaviorally far apart can be learnt using a learning rule on a synaptic timescale (Figure 3b).
From a reinforcement learning perspective, the TD(0) algorithm relies on a property called bootstrapping. This means that the successor representation is learned by first taking an initial estimate of the SR matrix (i.e. the previously learned weights), and then gradually adjusting this estimate (i.e. the synaptic weights) by comparing it to the states in the environment that the animal actually visits. This comparison is achieved by calculating a prediction error, similar to the widely studied one for dopamine neurons (Schultz et al., 1997). Since the synaptic connections carry information about the expected trajectories, in this case, the prediction error is computed between the predicted and observed trajectories (see Materials and methods).
The main point of bootstrapping, therefore, is that learning happens by adjusting our current predictions (e.g. synaptic weights) to match the observed current state. This information is available at each time step and thus allows learning over long timescales using synaptic plasticity alone. If the animal moves to a state in the environment that the current weights deem unlikely, potentiation will prevail and the weight from the previous to the current state will increase. Otherwise, the opposite will happen. It is important to notice that the prediction error in our model is not encoded by a separate mechanism in the way that dopamine is thought to do for reward prediction (Schultz et al., 1997). Instead, the prediction error is represented locally, at the level of the synapse, through the depression and potentiation terms of our STDP rule, and the current weight encodes the current estimate of the SR (see Materials and methods). Notably, the prediction error is equivalent to the TD( λ ) update. This mathematical equivalence ensures that the weights of our neural network track the TD( λ ) update at each state, and thus stability and convergence to the theoretical values of the SR. We therefore do not need an external vector to carry prediction error signals as proposed in Gardner et al., 2018;Gershman, 2018. In fact, the synaptic potentiation in our model updates a row of the SR, while the synaptic depression updates a column.
On the other extreme, for very fast timescales such as replays, TD(1) is equivalent to online Monte Carlo learning (MC), which does not bootstrap at all. Instead, MC samples the whole trajectory and then simply takes the average of the discounted state occupancies to update the SR (see Materials and methods). During replays, the whole trajectory falls under the plasticity window and the network can learn without bootstrapping. For all cases in between, the network partially relies on bootstrapping and we correspondingly find a λ between 0 and 1.
In summary, in our framework, synaptic plasticity leads to the development of a successor representation in which synaptic weights can be directly linked to the successor matrix. In this framework, we can learn over behavioral timescales even though our plasticity rule acts on the scale of milliseconds, due to the bootstrapping property of TD algorithms.

Different discounting for space and time
In reinforcement learning, it is usual to have delay-discounting: rewards that are further away in the future are discounted compared to rewards that are in the immediate future. Intuitively, it is indeed clear that a state leading to a quick reward can be regarded as more valuable compared to a state that only leads to an equal reward in the distant future. For tasks in a tabular setting, with a discrete state space and where actions are taken in discrete turns, such as for example chess or our simple linear track discussed in section 'The Successor Representation', one can simply use a multiplicative factor 0 < γ ≤ 1 for each state transition. In this case the discount follows an exponential dependence, where rewards that are n steps away are discounted by a factor of γ n .
In order to still use the above exponential discount when time is continuous, the usual approach is to discretize time by choosing a unit of time. However, this would imply one can never remain in a state for a fraction of this unit of time, and it is not clear how this unit would be chosen. Our framework deals naturally with continuous time, through the monotonically decreasing dependence of the discount parameter γ on the time an agent remains in a state, T. The dependence on T can be interpreted as an increased discounting the longer a state lasts.
In this way, instead of discounting by γ n when the agent stays n units of time in a certain state, we would discount by γ(n · T) . More generally, for any arbitrary time T, a discount corresponding to γ(T) will be applied. This allows the agent to act in continuous time (Figure 3c and e). Interestingly, the dependence of γ on T in our model is not exponential as in the tabular case. Instead, we have a hyperbolic dependence. This hyperbolic discount is well studied in psychology and neuroeconomics and appears to agree well with experimental results (Laibson, 1997;Ainslie, 2012).
The difference between a hyperbolic discount and an exponential discount lays in the fact that we will attribute a different value to the same temporal delay, depending on whether it happens sooner or later. A classic example is that, when given the choice, people tend to prefer 100 dollars today instead of 101 dollars tomorrow, while they tend to prefer 101 dollars in 31 days instead of 100 dollars in 30 days. They therefore judge the 1 day of delay differently when it happens later in time. Exponential discounting, on the other hand, always attributes the same value to the same delay no matter when it occurs.
Our model therefore combines two types of discounting: exponential when we move through space -when sequentially activating different place cells -and hyperbolic when we move through time -when we prolong the activity of the same place cell.
The discount factor γ also depends on other parameters such as firing rate and STDP amplitudes (see Equation 22 in the Appendix). This gives our model the flexibility to encode state-dependent discounting even when the trajectories and times spent in the states are the same. Such statedependent discounting can be useful to for example encode salient locations in the environment such as landmarks or reward locations (Figure 3c and f).

Bias-variance trade-off
As discussed previously (section 'Learning the successor representation in biologically plausible networks'), the TD( λ ) algorithm unifies the TD algorithm and the MC algorithm. In our framework, replay-like neuronal activations are equivalent to MC, while behavioral-like activity is equivalent to TD. In this section, we will discuss how the replays and behavior can work together when learning the cognitive map of an environment, leveraging the strengths of MC and TD.
The MC algorithm effectively works by averaging over the sampled trajectories. As such, the estimated SR matrix will be a close approximation of the theoretical value. The difference between the estimated and theoretical value is commonly referred to as bias. We can therefore say that the MC algorithm presents low bias. However, if the agent moves in the environment at random, the sampled trajectories will be quite different from each other. When taking the average, the estimated value will therefore fluctuate a lot. In this case, we say that the MC estimate has high variance as well ( Figure 4A and B).
Unlike MC, the TD algorithm updates its estimate of the SR by comparing the current estimate of the SR with the actual state the agent transitioned to. Because of the dependence on the current estimate, this estimate will be incrementally refined with small updates. In this way, the SR estimate will not fluctuate much, and be lower in variance. However, by this dependence on the current estimate, we introduce a bias in the algorithm, which will be especially significant when our initial estimate of the SR is bad ( Figure 4A and B). The TD algorithm therefore presents high bias and low variance.
We now apply these concepts to learning in a novel environment. Since the MC algorithm is unbiased by the initial estimate of the SR, replays should initially speed up learning in an unfamiliar environment. Later on, when the environment becomes familiar, the SR estimate is already closer to the exact value. At this point, we prefer to have low variance and thus the TD algorithm will be preferred. We confirm this logic using our spiking neural networks, and show how we can have both quick learning and low error at convergence if we proportionally have more replays at the first trials in a novel environment (Figure 4a-e). In contrast, when having an equal proportion of replays throughout the whole simulation, we do not yield as quick learning as MC and as low asymptotic error as TD (Figure 4-figure supplement 1). Interestingly, the pattern of proportionally more replays in novel environments versus familiar environments has also been experimentally observed (Cheng and Frank, 2008;Figure 4f). Please note that, while we implemented an exponentially decaying probability for replays after entering a novel environment, different schemes for replay activity could be investigated. Note also that other mechanisms besides the successor representation could account for these results, including model-based reinforcement learning. The agent follows a stochastic policy starting from the initial state (denoted by START). The probability to move to either neighboring state is 50%. An epoch stops when reaching a terminal state (denoted with STOP). (B) Root mean squared error (RMSE) between the learned SR estimate and the theoretical SR matrix. The full lines are mean RMSEs over 1000 random seeds. Three cases are considered: (i) learning happens exclusively due to behavioral activity (TD STDP, green); (ii) learning happens exclusively due to replay activity (MC STDP, purple); (iii) A mixture of behavioral and replay learning, where the probabilities for replays drops off exponentially with epochs (Mix STDP, pink). The mix model, with a decaying number of replays learns as quickly as MC in the first epochs and converges to a low error similar to TD, benefiting both from the low bias of MC at the start and the low variance of TD at the end. (C, D, E) Representative weight changes for each of the scenarios. Full lines show various random seeds, shaded areas denote one standard deviation over 1000 random seeds. (F) More replays are observed when an animal explores a novel environment (day 1). Panel F adapted from Figure 3A in Cheng and Frank, 2008. The online version of this article includes the following figure supplement(s) for figure 4:

Leveraging replays to learn novel trajectories
In the previous section, the replays re-activated the same trajectories as seen during behavior. In this section, we extend this idea and show how in our model replays can be useful during learning even when the re-activated trajectories were not directly experienced during behavior.
For this purpose, we reproduce an place-avoidance experiment from Wu et al., 2017. In short, rats are allowed to freely explore a linear track on day 1. Half of the track is dark, while the other half is bright. On day 2, the animals did four trials separated by resting periods: in the first trial (pre), the animals were free to explore the track; in the second trial (shock), they started in the light zone and received two mild footshocks when entering in the shock zone; in the third and fourth trial (post and re-exposure, respectively), they were allowed to freely explore the track again, but starting from the light zone or the shock zone respectively (Figure 5a). In the study, it was reported that during the post trial, animals tended to stay in the light zone and forward replays from the current position to the shock zone were observed when the animals reached the boundary between the light and the dark zone (Figure 5b and c).    We simulated a simplified version of this task. Our simulated agent moves through the linear track following a softmax policy, and all states have equal value during the first phase (pre) (Figure 5d, blue trajectories). Then, the agent is allowed to move through the linear track until it reaches the shock zone and experiences a negative reward. Finally, the third phase is similar as the first phase and the animal is free to explore the track. Two versions of this third phase were simulated. In one version, there are no replays (Figure 5d, orange trajectories in left panel), while in the second version a forward replay until the shock zone is simulated every time the agent enters the middle state (Figure 5d, orange trajectories in right panel, replays not shown). The replays affect the learning of the successor representation and the negative reward information is propagated towards the decision point in the middle of the track. The states in the dark zone therefore have lower value compared to the case without replays (Figure 5e). In turn, this different value affects the policy of the agent which now tends to avoid the dark zone all together, while the agent without replays still occupies many states of the dark zone as much as states in the light zone (Figure 5f). Moreover, even when doubling the amount of SR updates in the scenario without replays, the behavior of the agent remains unaltered ( Figure 5-figure supplement 1). This shows that it is not the amount of updates, but the type of policy that is important when updating the SR, and how using a different policy in the replay activity can significantly alter behavior.
Our setup for this simulation is simplified, and does not aim to reproduce the complex decision making of the rats. Observe for example the peak of occupancy of the middle state by the animals (Figure 5c), which is not captured by our model because we assume the agent to spend the same amount of time in each state. Nonetheless, it is interesting to see how replaying trajectories that were not directly experienced before, in combination with a model allowing replays to affect the learning of a cognitive map, can substantially influence the final policy of an agent and the overall performance. This mental imagination of trajectories could be exploited to refine our cognitive maps, avoiding unfavourable locations or finding shortcuts to rewards. It is important to note here that, while we are suggesting a potential role for the SR in solving this task, the data itself would also be compatible with a model-based strategy. In fact, experimental evidence suggests that humans may use a mixed strategy involving both model-based reinforcement learning and the successor representation .

Discussion
In this article, we investigated how a spiking neural network model of the hippocampus can learn the successor representation. Interestingly, we show that the updates in synaptic weights resulting from our biologically plausible STDP rule are equivalent to TD( λ ) updates, a well-known and powerful reinforcement learning algorithm.

Reinforcement learning
Our network learns the SR in the CA3-CA1 weights. Since we have modeled neurons to integrate the synaptic EPSPs and generate spikes using an inhomogeneous Poisson process based on the depolarization, the firing rate is proportional to the total synaptic weights. Therefore, the successor representation can be read out simply by a downstream neuron. Moreover, since the value of a state is defined by the inner product between the successor matrix and the reward vector, it is sufficient for the synaptic weights to the downstream neuron to learn the reward vector, and the downstream neuron will then encode the state value in its firing rate (see Figure 5-figure supplement 2). While the neuron model used is simple, it will be interesting for future work to study analogous models with non-linear neurons.
It is worth noting that, during learning, both pre-synaptic and post-synaptic layers receive external inputs representing the current state (Equation 10 and Equation 11 in Materials and methods). This may induce a distortion in the read out of the diagonal elements of the SR matrix (see Equations 13 and 15, and Figure 5-figure supplement 2). At a first glance, this may indicate that learning and reading out are antagonistic. However, there are multiple ways we could resolve this apparent conflict: (i) Since the external current in CA1 is present for only a fraction of the time T in each state, the readout might happen during the period of CA3 activation exclusively; (ii) The readout may be over the whole time T but becomes more noisy towards the end. Even in the case where the readout is noisy, the distortion would be limited to the diagonal elements of the matrix; (iii) Learning and readout may be separate mechanisms, where the CA3 driving current is present during readout only. This could be for instance signaled by neuromodulation (e.g. noradrenaline and acetylcholine are active during learning but not exploration Micheau and Marighetto, 2011;Hasselmo and Sarter, 2011;Robbins, 1997;Teles-Grilo Ruivo and Mellor, 2013;Palacios-Filardo et al., 2021), or it could be that readout happens during replays; (iv) The weights to or activation functions of the readout neuron may learn to compensate for the distorted signal in CA1.
Furthermore, we can notice that the external inputs encoding the current state activate CA3 first, and CA1 later. The delay between these activations θ/T (Equation 10 and Equation 11 in Materials and methods) is an arbitrary parameter that can be adjusted. Varying this delay will change the reinforcement learning representation, especially parameters λ and γ , but also the strength of the input current (see Figure 5-figure supplement 3). However, this will not impact the distortion of the diagonal elements of the SR matrix, which remains similar across various delay values θ/T (see Figure 5figure supplement 4).

Biological plausibility
Uncovering a connection between STDP and TD( λ ) shows how, using minimal assumptions, a theoretically grounded learning algorithm can emerge from a biological implementation of plasticity. Similar learning rules have indeed been observed in the hippocampus (Shouval et al., 2002 and proposed on theoretical grounds Mehta et al., 2000;Waddington et al., 2012;van Rossum et al., 2012).
The TD algorithm is most commonly known in neuroscience for describing how reward prediction can be computed in the brain. More specifically, it is widely believed that dopamine neurons in the ventral tegmental area (VTA) and substantia nigra (SNc) encode the prediction error between the observed and expected reward (Schultz et al., 1997), dopamine thus acts as a global signal that can be broadcasted to other areas of the brain like the striatum to compute the expected reward. In our model, the TD algorithm estimates the SR (i.e. expected future occupancy), rather than the value. However, since the prediction error for the SR is different for every synaptic connection (i.e. each pair of states), it is not clear how it could be carried by a global signal analogous to dopamine. The SR would need multiple signals, or a matrix transformation of the global signal. Furthermore, we would need to postulate that such error -or errors -are computed elsewhere in the brain. Instead, in our model, the prediction error simply emerges from the synaptic plasticity rule itself. Furthermore, thanks to the presynaptic depression, our STDP rule alone allows us to compute negative prediction errors, which still poses an open challenge for computation with dopamine because of the low baseline dopaminergic firing rate (Glimcher, 2011;Daw et al., 2002;Matsumoto and Hikosaka, 2007).
Our framework smoothly connects a temporally precise spiking code with a fully rate-based code, and anything in between. As we have proven mathematically, this translates in moving smoothly from Monte Carlo to Temporal Difference by means of TD( λ ). Fast spiking sequences (temporal code) can be used for consolidation of previous experiences using Monte Carlo learning, while the behavioral timescale activity (rate code) results in TD updates, allowing learning on the timescale of seconds even with plasticity timeconstants on the order of milliseconds. This type of Hebbian learning over behavioral timescale exploits the bootstrapping property of TD, and is different than the one-shot behavioral plasticity described in Bittner et al., 2017. However, these two mechanisms could be complementary, where the latter could play a more significant role in the formation of new place fields, while the former would be more relevant to shape the existing place fields to contain predictive information. Learning on behavioral timescales using STDP was also investigated in Drew and Abbott, 2006. The main difference between Drew and Abbott, 2006 and our work, is that the former relies on overlapping neural activity between the pre-and post-synaptic neurons from the start, while in our case no such overlap is required. In other words, our setup allows us to learn connections between a presynaptic neuron and a postsynaptic neuron whose activities are separated by behavioral timescales initially. For this to be possible, there are two requirements: (1) the task needs to be repeated many times and (2) a chain of neurons are consecutively activated between the aforementioned presynaptic and postsynaptic neuron. Due to this chain of neurons, over time the activity of the postsynaptic neuron will start earlier, eventually overlapping with the presynaptic neuron.
In our work, we did not include theta modulation, but phase precession and theta sequences could be yet another type of activity within the TD lambda framework. A recent work (George et al., 2023) incorporated the theta sweeps into behavioral activity, showing it approximately learns the SR. Moreover, theta sequences allow for fast learning, playing a similar role as replays (or any other fast temporal-code sequences) in our work. By simulating the temporally compressed and precise theta sequences, their model also reconciles the learning over behavioral timescales with STDP. In contrast, our framework reconciles both timescales relying purely on rate-coding during behavior. Finally, their method allows to learn the SR within continuous space. It would be interesting to investigate whether these methods co-exist in the hippocampus and other brain areas. Furthermore, (Fang et al., 2023) et al. recently showed how the SR can be learned using recurrent neural networks with biologically plausible plasticity.
There are three different neural activities in our proposed framework: the presynaptic layer (CA3), the postsynaptic layer (CA1), and the external inputs. These external inputs could for example be location-dependent currents from the entorhinal cortex, with timings guided by the theta oscillations. The dependence of CA1 place fields on CA3 and entorhinal input is in line with lesion studies (see e.g. Brun et al., 2008;Hales et al., 2014;O'Reilly et al., 2014). It would be interesting for future studies to further dissect the role various areas play in learning cognitive maps.
Notably, even though we have focused on the hippocampus in our work, the SR does not require predictive information to come from higher-level feedback inputs. This framework could therefore be useful even in sensory areas: certain stimuli are usually followed by other stimuli, essentially creating a sequence of states whose temporal structure can be encoded in the network using our framework. Interestingly, replays have been observed in other brain areas besides the hippocampus (Kurth-Nelson et al., 2016;Staresina et al., 2013). Furthermore, temporal difference learning in itself has been proposed in the past as a way to implement prospective coding (Brea et al., 2016).

Replays
We have also proposed a role for replays in learning the SR, in line with experimental findings and RL theories Momennejad et al., 2017). In general, replays are thought to serve different functions, spanning from consolidation to planning (Roscow et al., 2021). Here, we have shown that when the replayed trajectories are similar to the ones observed during behavior, they play the role of speeding up and consolidating learning by regulating the bias-variance trade-off, which is especially useful in novel environments. On the other hand, if the replayed trajectories differ from the ones experienced during wakefulness, replays can play a role in reshaping the representation of space, which would suggest their involvement in planning. Experimentally, it has been observed that replays often start and end from relevant locations in the environment, like reward sites, decision points, obstacles or the current position of the animal (Ólafsdóttir et al., 2015;Pfeiffer and Foster, 2013;Jackson et al., 2006;Mattar and Daw, 2017). Since these are salient locations, it is in line with our proposition that replays can be used to maintain a convenient representation of the environment. It is worth noticing that replays can serve a variety of functions, and our framework merely proposes additional beneficial properties without claiming to explain all observed replays. For example, in addition to forward replays, also reverse replays are ubiquitous in hippocampus (Pfeiffer, 2020). The reverse replays are not included in our framework, and it is not clear yet whether they play different roles, with some evidence suggesting that reverse replays are more closely tied to the reward encoding (Ambrose et al., 2016). Moreover, while indirect evidence supports the idea that replays can play a role during learning (Igata et al., 2021), it is not yet clear how synaptic plasticity is manifested during replays (Fuchsberger and Paulsen, 2022).

Learning flexibility
Multiple ideas from reinforcement learning, such as TD( λ ), state-dependent discounting and the successor representation, emerge quite naturally from our simple biologically plausible setting. We propose in our work that time and space can be discounted differently. Moreover, the flexibility to change the discounting factor by modulating firing rates and plasticity parameters -which is ubiquitous in neural circuits -suggests that these mechanisms could be used to encode a variety of information in a cognitive map. Moreover, the specific dependence of the discount factor on the biological parameters leads to experimentally testable predictions. Indeed, our framework predicts well-defined changes in place fields after modulations of firing rates, speed of the agent or neuromodulation of the plasticity parameters (Figure 3). Importantly, the discount parameter also depends on the time spent in each state. This eliminates the need for time discretization, which does not reflect the continuous nature of the response of time cells (Kraus et al., 2013).

Limitations of the reinforcement learning framework
We have already outlined some of the benefits of using reinforcement learning for modeling behavior, including providing clear computational and algorithmic frameworks. However, there are several intrinsic limitations to this framework. For example, RL agents that only use spatial data do not provide complete descriptions of behavior, which likely arises from integrating information across multiple sensory inputs. Whereas an animal would be able to smell and see a reward from a certain distance, an agent exploring the environment would only be able to discover it when randomly visiting the exact reward location. Furthermore, the framework rests on fairly strict mathematical assumptions: typically the state space needs to be markovian, time and space need to be discretized (which we manage to evade in this particular framework) and the discounting needs to follow an exponential decay. These assumptions are simplistic and it is not clear how often they are actually met. Reinforcement Learning is also a sample-intensive technique, whereas we know that some animals, including humans, are capable of much faster or even one-shot learning.
Even though we have provided a neural implementation of the SR, and of the value function as its read-out (see Figure 5-figure supplement 2), the whole action selection process is still computed only at the algorithmic level. It may be interesting to extend the neural implementation to the policy selection mechanism in the future.
Taken together, our work joins -in a single framework -a variety of concepts from the neuronal level over cognitive theories to reinforcement learning.

Materials and methods
The successor representation In a tabular environment, we define the value of a state s as being the expected cumulative reward that an agent will receive following a certain policy starting in s . The future rewards are multiplied by a factor 0 < γ n ≤ 1 , where n is the number of steps until reaching the reward location and 0 < γ ≤ 1 is the delay discount factor. It is usual to use 0 < γ < 1 , which ensures that earlier rewards are given more importance compared to later rewards. Formally, the value of a state s under a certain policy π is defined as Here, a denotes the action, R(s, a) is the reward function and P(s ′ |s, a) is the transition function, i.e. the probability that taking an action a in state s will result in a transition to state s ′ . Following (Dayan, 1993), we can decompose the value function into the inner product of reward function and successor matrix This representation is known as the successor representation (SR), where each element M ij represents the expected future occupancy of state j when in state i . By decomposing the value into the SR and the reward function (Equation 3), relearning the state values V after changing the reward function is fast, similar to model-based learning. At the same time, the SR can be learned in a modelfree manner, using for example temporal difference (TD) learning . Derivation of the TD( λ ) update for the SR The TD( λ ) update for the SR is then implemented according to (see e.g. Sutton and Barto, 1998) ∆M(j, i) = δ TD 0 + γλδ TD 1 + (γλ) 2 δ TD 2 + . . .
Using δ TD i for the TD error at step i and δxy for the Kronecker delta, corresponds to the TD error for element M(j + n, i + n) of the successor representation after the transition from state j + n to state j + n + 1 . Combining Equations 5 and 6, we find and

Neural network model Plasticity rule
The synaptic plasticity rule (Figure 1d) consists of a weight-dependent depression for presynaptic spikes and a spike-timing dependent potentiation, given by Here, w ij represents the synaptic connection from presynaptic neuron j to postsynaptic neuron i , Tr j LTP is the plasticity trace, a low-pass filter of the presynaptic spike train with time constant τ LTP , t j and t i are the spike times of the postsynaptic and presynaptic neuron respectively, A LTP and A LTD are the amplitudes of potentiation and depression respectively, η STDP is the learning rate for STDP and the δ(·) denotes the Dirac delta function.

Place cell activation
We assume that each state in the environment is represented by a population of place cells in the network. In our model, this is achieved by delivering place-tuned currents to the neurons. Whenever a state S = j is entered, the presynaptic neurons encoding state j start firing at a constant rate ρ pre for a time θ , following a Poisson process with parameter ρ pre h (t) . The other presynaptic neurons are assumed to be silent: where the Kronecker delta function is defined as δ hj = 1 if h = j and zero otherwise. Here we use the index j to denote any neuron belonging to the population of neurons encoding state j . After a short delay, at time t * , a similar current ρ bias is delivered to the postsynaptic neuron encoding state j , for a duration of time ω .
Besides the place-tuned input current, CA1 neurons receive inputs from the presynaptic layer (CA3). The postsynaptic potential ρ post i when the agent is in state j is thus given by with the first sum running over all Npop presynaptic neurons encoding state j , and the second sum over all presynaptic firing times t f k of neuron k happened before t . The excitatory postsynaptic current κ is modeled as an exponential decay described as κ(x) = ϵ 0 e −x/τm for x ≥ 0 and zero otherwise. Each CA1 neuron i fires following an inhomogeneous Poisson process with rate ρ post i (t) . Note that, in most simulations we will use a single neuron in the population Npop = 1 . In addition, we normally set t * = θ and ω = T − θ . However, we will keep these as explicit parameters for theoretical purposes.

Equivalence with TD( λ ) Total plasticity update
Since we have the mathematical equation for the plasticity rule, and CA3 and CA1 neurons follow an inhomogeneous Poisson process with time-dependent firing rate, we can calculate analytically the average total weight change for the synapse w ij , given a certain trajectory (details in the Appendix). Please notice that our calculation is based on Kempter et al., 1999, which takes into account the fact that our plasticity rule is sensitive to spike timing and involves a spike-spike correlation term. We find that: where N is the number of states until the end of the trajectory and

Comparison with TD( λ )
Comparing the total weight change due to STDP (Equation 13) to the TD( λ ) update (Equation 8), we can see that the two equations are very similar in form: We impose w ij = M(j, i) , and find: where A, B, B ′ and C are defined as in Equations 14, 15, and 16. Hence, our plasticity rule is learning the Successor Representation through a TD( λ ) model with parameters: To ensure the learning rate η is positive, one condition resulting from Equation 21 is Learning during normal behavior ( θ >> τ LTP ) During normal behavior, we assume the place-tuned currents are on larger timescales than the plasticity constants: θ, ω >> τ LTP . We can see from Equations 14 and 16 that the factor A grows linearly with θ while C grows exponentially with θ . From Equation 23, we then have (See also Figure 2-figure supplement 1).

Learning during replays ( θ << τ LTP ) Assumptions
For the replay model we assume the place-tuned currents are impulses, which make the neurons emit exactly one spike at a given time. Specifically, we can make the duration of the place-tuned currents go to 0, while the intensity of the currents goes to infinity. For simplicity, we will take: Furthermore, we assume that the contribution of the postsynaptic currents due to the single presynaptic spikes is negligible in terms of driving plasticity, allowing us to set ϵ 0 → 0

Calculations of TD parameters
Given the assumptions above, we can see from Equations 14 and 16 that: For Equation 15, we can use the Taylor expansion for e x τ around x = 0 , such that: e Using Equations 21, 22, 23 and 18, we can calculate the parameters and constraints for the TD model: As expected, the bootstrapping parameter λ = 1 (see also Figure 2-figure supplement 1).
Alternative derivation of replay model Place cell activation during replays We model a replay event as a precise temporal sequence of spikes. Since every neuron represents a state in the environment, a replay sequence reproduces a trajectory of states. We assume that, when the agent is in state S = j , the neurons representing state j fire npre spikes at some point in the time interval t ∈ [0, σ] , where the exact firing times are uniformly sampled. After a short delay, the CA1 neurons representing state j fire npost spikes at a time uniformly sampled from the interval [t * , t * + σ] . The time between two consecutive state visits is T . The exact number of spikes in each replay event is random but small. Specifically, it is sampled from the set {0, 1, 2} according to the probability vector It is worth noting here that other implementations are possible but that we assume the average number of spikes in each state is 1, and that the average time between a presynaptic and a postsynaptic spike is t * . The model could be further generalized for a higher number of average spikes per state.

Plasticity update
We can consider again our learning rule, composed of a positive pre-post potentiation window and presynaptic weight-dependent depression (Equation 9). Let's consider the synapse w ij , we can see that on average the total amount of depression will be determined by the number of times the state j is visited in the trajectory replayed: where N j is the number of times the state j is visited. The amount of potentiation will be determined, instead, by the time difference between the postsynaptic and presynaptic firing times, which encode the distance between state j and state i : where n ij k represents the number of times the agent visited state i k steps after j . Combining the equations above we find that: If we assume that the this value has converged to its stationary state, ∆w ij = 0 ; we find that the stable weight is: which is the definition of the Successor Representation matrix (Equation 4). Indeed, w ⋆ ij is computing the sample mean of the discounted distance between states i and j , which is equivalent to performing an every-state Monte Carlo or TD( λ =1) update. Notably, from Equation 29, we have that the learning rate for the Monte Carlo update is given by: Simulation details for Figure 2 A linear track with four states is simulated. The policy of the agent in this simulation is to traverse the track from left to right, with one epoch consisting of starting in state 1 and ending in state 4. One simulation consists of 50 epochs, and we re-run the whole simulation ten times with different random seeds. Over these ten seeds, mean and standard deviation of the synaptic weights are recorded after every epoch. Our neural network consists of two layers, each with a single neuron per state (as in Figure 1). Synaptic connections are made from each presynaptic neuron to all postsynaptic neurons, resulting in a 4-by-4 matrix which is initialized as the identity matrix. The plasticity rule and neuronal activations follow Equations 9-12.
The STDP parameters are listed in Table 1.
In the replay case, we have a sequence of single spike per neuron (see Figure 2b and section 'Alternative derivation of replay model'). Following Equation 27, we choose T = − log ( γ ) τ LTP ≈ 7ms, where γ and τ LTP are the same as in Table 1. We set θ = 2 ms and σ = 0.5 ms. By setting the η stdp = η A LTP exp(θ/τ LTP ) , the corresponding TD( λ ) parameters are λ = 1 , γ = 0.89 , η = 0.12 just as in the behavioral case. More details on the place cell activation during replays in our model can be found in section 'Alternative derivation of replay model'. Using exactly one single spike per neuron with the above parameters would allow us to follow the TD(1) learning trajectories without any noise. For more biological realism, we choose p 1 = 0.15 in Equation 28, in order to achieve an equal amount of noise due to the random spiking as in the case of behavioral activity (see Figure 4-figure supplement 2). Simulation details for Figure 3 Using the same neural network and plasticity parameters as the behavioral learning in Figure 2 (see previous section), we simulate the linear track in the following two situations: • The third state has T=200ms instead of 100ms. All other parameters remain the same as in Figure 2. Results plotted in Figure 3E.
• The third state has ρpre = 0.2 ms −1 instead of 0.1 ms −1 . All other parameters remain the same as in Figure 2. Results plotted in Figure 3F.
Simulation details for Figure 4 A linear track with three states is simulated, and the agent has 50% probability to move left or right in each state (see Figure 4A). One epoch lasts until the agent reaches one of the STOP locations. We then use the same neural network and plasticity parameters as used for Figure 2. We simulate three scenarios: • Only replay-based learning during all epochs (no behavioral learning). This scenario corresponds to MC STDP in Figure 4B and to Figure 4C. • Mixed learning using both behavior and replays. The probability for an epoch to be a replay is decaying over time following exp(−i/6) , with i the epoch number. This scenario corresponds to Mix STDP in Figure 4B and to Figure 4E. • Only behavioral learning during all epochs (no replays). This scenario corresponds to TD STDP in Figure 4B and to Figure 4D.
Simulation details for Figure 5 A linear track with 21 states is simulated. The SR is initialized as the identity matrix, and the reward vector (containing the reward at each state) is also initialized as the zero vector. We simulate the learning of the SR during behavior using the theoretical TD(0) updates and during replays using the theoretical TD(1) updates. The value of each state is then calculated as the matrix-vector product between the SR and the reward vector, resulting in an initial value of zero for each state. The policy of the agent is a softmax policy (i.e. the probability to move to neighboring states is equal to the softmax of the values of those neighboring states). The first time the agent reaches the leftmost state of the track (state 1), the negative reward of -2 is revealed, mimicking the shock in the actual experiments, and the reward vector is updated accordingly for this state.
We now simulate two scenarios: in the first scenario, the agent always follows the softmax policy and no replays are triggered (see Figure 5D, left panel). In the second scenario, every time the agent enters the dark zone from the light zone (i.e. transitions from state 12 to state 11 in our simulation), a replay is triggered from that state until the leftmost state (state 1) (see Figure 5D, right panel). Both scenarios are simulated for 2000 state transitions. We then run these two scenarios 100 times and calculate mean and standard deviation of state occupancies ( Figure 5F).
Finally, since the second scenario has more SR updates than the first scenario, we also simulate the first scenario for 4000 state transitions ( Figure 5-figure supplement 1) and show how the observed behavior of Figure 5 is unaffected by this. Additional files

Data availability
The current manuscript is a computational study, so no data have been generated for this manuscript.

Appendix 1
Analytical derivations for the total weight change in the behavioural model Presynaptic rate during state j Whenever a state S = j is entered, the presynaptic neurons encoding state j start firing at a constant rate ρ pre for a time θ , following a Poisson process with parameter ρ pre j (t) : The other presynaptic neurons are silent.
Postsynaptic rate during state j The average postsynaptic rate can be calculated as follows. The probability of a presynaptic spike between t and t + dt is equal to ρ j (t)dt . The size of the presynaptic population encoding state j is equal to Npop and each excitatory postsynaptic potential (EPSP) is modeled by an immediate jump with amplitude ϵ 0 w ij , followed by exponential decay of EPSP with time constant τm . Following Equation 12 in the main paper, reproduced below, we find that the average postsynaptic potential at time t is given by (assuming t=0 when entering the state j ): We assume that w ij (t) changes slowly compared to the timescale θ allowing us to consider the weight constant during that time. We can then approximate the average postsynaptic rate as: If t ⋆ < θ , both the first and the second term will contribute to the postsynaptic rate in the time between t ⋆ and θ .

LTP trace during state j
Given Equation 9 in the main paper, reproduced below, and combined with Equation 35, we can calculate the evolution of the LTP trace for neuron j during state j : For 0 ≤ t < θ , the presynaptic neuron j is active and therefore the trace builds up with the presynaptic spikes, for t ≥ θ , the trace decays exponentially with time constant τ LTP .

Total amount of LTP during state j
Following (Kempter et al., 1999), first we calculate the amount of LTP without taking into account spike-to-spike correlation: The probability for a postsynaptic spike between t and t + dt is ρ post i (t) dt . The amount of LTP due to a single spike at time t is A LTP Tr j LTP (t) . Hence, combining Equations 37 and 38, the total amount of LTP during a state (i.e. between time 0 and T ) becomes: Following (Kempter et al., 1999), the amount of LTP due to the causal part (each presynaptic spike temporarily increase the probability of a postsynaptic spike) is given by: Combining equations for the non-causal 40 and causal 41 parts, we get the total amount of LTP during a state (assuming τm << τ LTP ) : +A LTP θρ pre τ mτLTP τm+τLTP ϵ 0 w ij (42) Total amount of LTD during state j There is a weight-dependent depression for each presynaptic spike, hence the amount of LTD during a state is given by: Total plasticity during state j Combining Equations 42 and 43, we can calculate the total amount of plasticity during the time the agent spends in the current state j :

Plasticity due to states transitioning
Once the agent leaves state j , the decaying LTP trace can still cause potentiation due to the activity in the following states, j + n , with n = 1, 2, ... . Given that the agent spends a time T in each state, we find that the agent visits state j + n during time t ∈ [nT, nT + T) . We will now calculate the contribution to plasticity due to these state transitions.
Postsynaptic rate during the new state j + n During state j + n , the activity of the postsynaptic neurons is driven by the presynaptic neurons coding for j + n , and the bias current. We can thus generalize Equation 37 and find that the average postsynaptic rate ρ post i during state j + n is: LTP trace from state j , during the new state j + n Following Equation 38, we find that the amplitude of the LTP trace from state j during state j + n is: with 0 < t ′ < T .

LTP due to state transitioning
We can then calculate the amount of LTP between the presynaptic neuron j and the postsynaptic neuron i , when the agent is in state j + n . We refer to Equation 39 and find: The amount of plasticity in state j + n when starting from state j is thus: where It is worth noting that the parameter B derived here is the same as Equation 46.

Summary: total STDP update
If we combine together Equations 44 and 49, we have that the total weight change for the synapse w ij is given by: [B(e −T/τLTP ) n δ ij+n + C(e −T/τLTP ) n+1 w i,j+n+1 ] where N is the number of states until the end of the trajectory and A, B, C are as defined in Equations 45, 46 and 50 respectively. Analytical calculations for hyperbolic discounting From Equation 22 in the main paper, we have that, in the behavioural model γ = (1 − C A )e − T τ LTP . Here, we will derive an approximation to this value.
If we assume that θ >> τm, τ LTP , we can approximate A and C as: If we define ψ such that θ + ψ = T , we can rewrite and approximate the discount parameter as: From Equation 56, we can see that the discount γ follows a hyperbolic function if we increase the duration of the presynaptic current θ . If, instead, we vary ψ , the discount becomes exponential (Figure 2-figure supplement 1a and b).
Notice that this analysis extends to the replay model. Following what was done after Equation 26, we can connect the behavioural model with the replay model by making θ, ϵ 0 → 0 , which implies ψ → T . From Equation 56 we find that: which is exactly the definition of γ in the replay model (Equations 27 in Materials and methods). For replays, the discount is therefore strictly exponential.
Furthermore, using the same calculations and Equations 21 and 19 in the main paper, we can find approximated values for the other parameters too (Figure 2-figure supplement 1c and d).